Relevant Expressions in Large Corpora

نویسنده

  • J. Ferreira
چکیده

The automatic extraction of statistically relevant expressions of any language from raw (non-annotated) corpora is a very useful task, specially when it may be used for the study of data from old texts, given the unavailability of informants, the large amount of graphic variants found in that kind of texts and the scarcity of annotated texts. Furthermore, statistically based extraction of complex lexical units and relevant expressions from non-annotated corpora proves to be more precise and less expensive than usual methods that require the search of syntactic patterns from annotated corpus. Moreover, it is also interesting for linguistic, historical, cultural and literary studies. From the computational point of view, the extraction of those expressions enables the necessary means for incorporating complex lexical units in specialised lexica, which is also very useful for the development of the Natural Language Processing (NLP). This paper presents a study of the relevant expressions taken from a Medieval Portuguese corpus using a statistical method based on three main tools: The LocalMaxs algorithm, a statistical measure and the Fair Dispersion Point Normalisation concept. In order to assess our results we have tested several statistical measures, namely Specific Mutual Information (SI) (Church K.; K. Hanks (1990)), SCP (Silva J.F.; G.P. Lopes (1999)), Dice coefficient (Dice L. (1945)), Loglike coefficient (Dunning T. (1993)), and  coefficient (Gale W.A.; K.W. Church (1991)). We have compared the statistically relevant expressions extracted by these measures and evaluated the results and their relevance both for NLP and other domains of research, namely Linguistics, History, Cultural and Literary Studies. It should be stressed that the LocalMaxs algorithm and the Fair Dispersion Point Normalisation enabled approximately 20% improvement of precision obtained, except for the Dice and Loglike measures.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities

This paper investigates a special type of recurrent expressions, lexical bundles, defined as a sequence of three or more words that co-occur frequently in a particular register (Biber et al., 1999). Considering the importance of this group of multi-word sequences in academic prose, this study explores the forms and syntactic structures of three- and four-word bundles in English abstracts writte...

متن کامل

Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams

We present a new, efficient unsupervised approach to the segmentation of corpora into multiword units. Our method involves initial decomposition of common n-grams into segments which maximize within-segment predictability of words, and then further refinement of these segments into a multiword lexicon. Evaluating in four large, distinct corpora, we show that this method creates segments which c...

متن کامل

AIML Knowledge Base Construction from Text Corpora

Text mining (TM) and computational linguistics (CL) are computationally intensive fields where many tools are becoming available to study large text corpora and exploit the use corpora for various purposes. In this chapter we will address the problem of building conversational agents or chatbots from corpora for domain-specific educational purposes. After addressing some linguistic issues relev...

متن کامل

Extracting Lay Paraphrases of Specialized Expressions from Monolingual Comparable Medical Corpora

Whereas multilingual comparable corpora have been used to identify translations of words or terms, monolingual corpora can help identify paraphrases. The present work addresses paraphrases found between two different discourse types: specialized and lay texts. We therefore built comparable corpora of specialized and lay texts in order to detect equivalent lay and specialized expressions. We ide...

متن کامل

Mining the Web for Idiomatic Expressions Using Metalinguistic Markers

In this paper, methods for identification and delimitation of idiomatic expressions in large Web corpora are presented. The proposed methods are based on the observation that idiomatic expressions are sometimes accompanied by metalinguistic expressions, e.g. the word “proverbial”, the expression “as they say” or quotation marks. Even though the frequency of such idiom-related metalinguistic mar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009